Factors governing fibrillogenesis of polypeptide chains 
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Using lattice models we explore the factors that determine the tendencies of polypeptide chains 
to aggregate by exhaustively sampling the sequence and conformational space. The morphologies of 
the fibril- like structures and the time scales (r/ib) for their formation depend on a balance between 
hydrophobic and coulomb interactions. The extent of population of an ensemble of N* structures, 
which are fibril-prone structures in the spectrum of conformations of an isolated protein, is the major 
determinant of r/^. This observation is used to determine the aggregating sequences by exhaustively 
exploring the sequence space, thus providing a basis for genome wide search of fragments that are 
aggregation prone. 

PACS numbers: 87.15.A, 87.14. E 

Proteins that are unrelated by sequence or structure aggregate to form amyloid-like fibrils with a characteristic 
cross /3-structures [I] (a). The observation that almost any protein could form fibrils seemed to imply that fibril 
rates can be predicted solely based on sequence composition and the propensity to adopt global secondary structure. 
Such a conclusion has limited validity because it does not account for fluctuations that populate aggregation-prone 
structures. Despite the common structural characteristics of amyloid fibrils pQ the factors that determine the fibril 
formation tendencies are not understood. 

Experiments on fibril formation times (r/^) have been rationalized using global factors such as the hydrophobicity of 
side chains 0(a), net charge [2](b,c), patterns of polar and non-polar residues 0(d), frustration in secondary structure 
elements [2](e,f), and aromatic interactions [2(g). However, the inability to sample the sequences and conformational 
spaces exhaustively [3] has prevented deciphering plausible general principles that govern protein aggregation. Here, 
we obtain a quantitative correlation between intrinsic properties of polypeptide sequences and their fibril growth 
rates using lattice models, which have given remarkable insights into the general principles of protein folding and 
aggregation [4]. Using a modification of the model in [5 we explore the sequence-dependent variations of Tfn on 
the nature of conformations explored by the monomer. We highlight the role of aggregation-prone ensemble of N* 
structures [6] in the folding landscape of the monomer in determining Tfn, and the propensity of sequences to form 
fibrils. 

Lattice model. We use a lattice model [5] in which each chain consists of M connected beads that are confined to 
the vertices of a cube. The simulations are done using N identical chains with M = 8. The peptide sequence which 
is used to illustrate the roles of electrostatic and hydrophobic interaction is +HHPPHH- (Fig. [T]), where H, P, + and 
- are hydrophobic, polar, positively charged and negatively charged beads respectively [5]. 

The energy of N chains is [5J E = Y,T<j E si{i)si{j)K r ij ~ a ) + El<i Y^j E si{i)sm{j)K r ij ~ a ), where nj is 

the distance between residues i and j, a is a lattice spacing, sm(i) indicates the type of residue i from ra-th peptide, 
and 6(0) — 1 and zero, otherwise. The first and second terms represent intrapeptide and interpeptide interactions, 
respectively. 

The propensity of polar and charged residues to be "solvated" is mimicked using Ep a =-0.2 (in the units of hydrogen 
bond energy e#), where a= P,+,or -. To assess the importance of electrostatic and hydrophobic interactions, we vary 

either in the interval — 1.4 < E^ < or Ehh between -1 and 0. If E^ is varied, we set Ehh — — 1? while if 

Ehh is varied, then E-\ = —1.4. We used = E— = — E-\ /2 and all other contact interactions have E a p = 
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FIG. 1: (a) Spectrum of energies and low energy structures of the monomer sequence +HHPPHH-. H, P, + and - are in green, 

yellow, blue, and red, respectively. We set Ehh = — 1 and = 1.4. There are 1831 possible conformations that are spread 

among 17 possible energy values. The conformations in the first excited state represent the ensemble of N* structures and the 
N* conformation that coincides with the peptide state in the fibril (see Fig. [2^l) is enclosed in a box. (b) The probability P/v* 

of populating the structure in the box in (a) as a function of T for E-\ = 0, —0.3,0.6, —1 and -1.4 keeping Ehh = — 1. The 

arrow indicates T*, where P N * = Pn* x . Dependence of P™ x on P + _ for Ehh = — 1 (c), and on Ehh for E+- = -1.4 (d). 



0.2. 

Monomer spectra depends on E-\ and Ehh* The spectrum of energy states of the monomer for a given 

sequence is determined by exact enumeration of all possible conformations (Fig. [TJ. For all sets of contact energies 

the native state (NS) of the monomer is compactflowest energy conformation in Fig.[l^). For E^ < 0, the ensemble 

of N* structures are the first excited state (Fig. [Tp). However, if E + _ = 0, the ensemble of N* structures are part of 
the 19- fold degenerate states in the second excited state (see Supplementary Information (SI) Fig. 1). 

The population of the putative fibril-prone conformation in the monomeric state is P/v* = exp(— Ejy*)/Z, where 
Z, the partition function is obtained by exact enumeration. Fig. [TJd, shows the temperature dependence of P/v* 
for various values of (£"+-) interaction, with Ehh = —1 and other contact energies constant. Depending on £"+_, 
the maximum value of Pn* varies from 2% < Pff? x < 12% (Fig. [J). P^ ax decreases to a lesser extent as the 
hydrophobic interaction grows (Fig. [l]i). Here we consider only Ehh < —0.4 because the fibril-like structure is not 
the lowest-energy when Ehh > —0.4 . 

Morphology of lowest-energy structures of multi-chain systems depends on sequences. When multiple 
chains are present in the unit cell, aggregation is readily observed, and in due course they lead to ordered structures. 
We used the Monte Carlo (MC) [5] annealing protocol, which allows for an exhaustive conformational search, to find 

the lowest energy conformation. For non-zero values of E^ the chains adopt an antiparallel arrangement in the 

ordered protofilament, which ensures that the number of salt-bridge and hydrophobic contacts are maximized (Fig. 
[2^i). If E+- = then the lowest energy fibril structure has a vastly different architecture even though they are 
assembled from N* (Fig. |2b). The structure in Fig. |2b, in which a pair of N* conformations are stacked by flipping 
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FIG. 2: (a) The lowest energy fibril structure for E-\ = — 1.4 and Ehh = — 1. (b) Same as in (a) but with E+- = 0. (c) 

Double layer structure for Ehh — —0.4 but with E-\ — —1.4. (d) For E-\ — —1.4 and Ehh — —0.3 the fibril structure 

is entirely altered, (e) Temperature dependence of Tfib for E+- = —1.4 (circles) and E^ = —0.6 (triangles). N = 6 and 

Ehh — — 1. Arrows show the temperatures at which the fibril formation is fastest. 

one with respect to the other is rendered stable by maximizing the number of +P and -P contacts. We now set 

= — 1.4 and vary Ehh- For Ehh < —0.4, the fibril conformation adopts the same shape as that shown in Fig. 

[2^, but for E HH = -0.4 the energetically more favorable double-layer structure emerges (Fig. [2^). If Ehh > —0.3, 
then the lowest-energy conformation ceases to have the fibril-like shape (Fig. [2jl). The close packed heterogenous 
structure is stitched together by a mixture of the NS conformation and one of the second excited conformations. Even 
for this simple model a variety of lowest-energy structures of oligomers and protofilaments with different morphologies 
emerge, depending on a subtle balance between electrostatic and hydrophobic interactions. 

Dependence of Tfib on E-\ an d Ehh* Simulations were performed by enclosing N chains in a box with periodic 

boundary conditions and move sets described in ref. [5 j. The effect of finite size is discussed in SI, Fig. 2. The fibril 
formation time Tfib is defined as an average of first passage times needed to reach the fibril state with the lowest 
energy starting from initial random conformations. For a given value of T, we generated 50-100 MC trajectories to 
compute Tfib. We measure time in units of a Monte Carlo step (MCS), which is a combination of local and global 
moves. 

We performed an exhaustive study of the dependence of Tfib on the number of chains in the simulation box, N 
(SI, Fig. 2). For highly favorable interaction between the terminal charged residues, E + _ — —1.4, Tfib scales linearly 
with the size of the system (SI, Fig. 2a), while for less favorable interactions, £"+_ = —0.6 and -0.8, ln(r/^) scales 
linearly with the size of the system (SI, Fig. 2b). The temperature dependence of Tfib displays a U-shape (Fig. [2^) 
and the fastest assembly occurs at T m ^ n , which roughly coincides with the temperature, T*, where P/v* reaches 
maximum (Fig. [TJd). To probe the correlation between Tfib and E+_ and Ehh we performed simulations at T min . 
The dependence of Tfib on E+_ can be fit using Tfib ~ exp[— c(— E+_) a ] where a « 0.6 and the constant c « 7.12 and 
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FIG. 3: (a) Dependence of r/ib on E-\ for N = 6 (circles) and A = 10 (triangles) with Ehh = — 1- The solid curves are fits to 

2/ = c + c(-x) a , where a « 0.59. c = 21.32 and c = -7.12 and c = 25.14 and c = -9.23 for N = 6 and 10, respectively, (b) 

Dependence of r/i& on Ehh with = — 1.4 hold constant for A = 6 (solid circle) and A = 10 (solid triangles). Lines are 

fits y = 19.17 + 7.97x and 2/ = 22.69 + 8.56a: for N = 6 and 10, respectively. For N = 6 the first point Ehh = — 1 is excluded 
from fitting, (c) Dependence of r/i& on PJ^* X for A = 6 and 10. Symbols are the same as in (a) and (b) 77^ is measured in 
in %. The correlation coefficient for all fits R ~ 0.98. 



MCS and P^ x 



9.23 for TV = 6 and 10 respectively (Fig. [3^1). Thus, variation of E+- drastically changes not only the morphology 
of the ordered protofilament (Fig. [2|, but also t^&. As the strength of the charge interaction between the terminal 

beads increases, the faster is the fibril formation process. Interestingly, the fibril formation rate at = is about 

four orders of magnitude slower than that at E + _ = —1.4. The propensity to fibril assembly strongly depends on the 
charge states of the polypeptide sequences pQ (b) . 

By fixing = —1.4 we calculated the dependence of on the hydrophobic interaction (Fig.^), which may be 

approximated using r/^ ~ exp(cE##). Here constant c ~ 7.97 and 8.56 for N = 6 and 10, respectively. For N = 10, 
a change in hydrophobicity of AEhh = 0.6, leads to self-assembly rates that are more than two orders of magnitude. 
Thus, enhancement of hydrophobic interactions speeds up fibril formation rates [TJE]. 

Fibril formation rates depend on P/v* . A plot of the data in Fig. |3^i and ^jp as a function of P^ ax (Fig. ^) 
yields the surprising relation 

T fib = T% b eM-cP& ax ), (1) 

where the prefactor r° fih w 1.014 x 10 10 MCS and 3.981 xlO 11 MCS, and c w 0.9 and 1.0, for A 7 " = 6 and 10, 
respectively. Eq. [I] is also valid for three other degenarate conformations in the N* ensemble, which are structurally 
similar to the one enclosed in the box in Fig. la. There are a few implications of the central result given in Eq. [I] (i) 
The sequence-dependent spectrum of the monomer is a harbinger of fibril formation. In proteins there are multiple 
N* conformations corresponding to distinct free energy basins of attraction [6](b). Aggregation from each of the 
structures in the various basins of attraction could lead to fibrils with different morphologies (polymorphism) that 
cannot be captured using lattice models, (ii) Enhancement of P/v* either by mutation or chemical cross linking 
should increase fibril formation rates. Indeed, a recent experiment [9] showed that the aggregation rate of A/3i_4o- 
lactam[D23-K28], in which the residues D23 and K28 are chemically constrained by a lactam bridge, is nearly a 1000 
times greater than in the wild-type. Since the salt bridge constraint increases the population of the N* conformation 
in the monomeric state [10], it follows from Eq. [I] Tfn, should decrease, (hi) Since Pn*(T) depends on the spectrum 
of the precise sequence for a given set of external conditions, it follows that the entire free energy landscape of the 
monomer [6](b) and not merely the sequence composition as ascertained else where [1(b), should be considered in the 
predictions of the amyloidogenic tendencies, (iv) Eq. fl] is suggestive of a fluctuation-driven nucleation mechanism 
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with a complicated temperature dependence, (v) Finally, as a negative control, plots of ln(r/^) as a function of P™ ax 5 
where C represents a conformation from the second or the third excited state (SI, Fig. 3) show that Eq. [I] does not 
hold for these structures. 

Sequence space scanning. We use Eq. [I]to determine the amylome [TT] , the universe of sequences in the lattice 
model, that can form fibrils. We posit that aggregation prone sequences are those with a unique native state with a 
maximum in P N *(T) in the interval 1.0 < T*/T F < 1.25. If T F = 300K, which is physically reasonable, T* = 375K 
if ^ = 1.25. Thus, for values of ^ > 1.25 T* would be far too high to be physically relevant. Our conclusions 
will not change by increasing ^- or alternatively by choosing a reasonable threshold value for Pat*. Out of the 
65,536 sequences only 217 satisfy these criteria (see Supplementary Information (SI) for details). The sequence space 
exploration shows that there is a high degree of correlation between the positions of charged and hydrophobic residues 
leading to a limited number of aggregation prone sequences with +HHPPHH- being an example. In addition, there 
are substantial variations in T*/Tp for sequences with identical sequence composition, which reinforces the recent 
finding [TT] that context in which charged and hydrophobic residues are found is important in the tendency to form 
amyloid-like fibrils. Our study also provides a basis for genome wide search for consensus sequences with propensity 
to aggregate. 

The work was supported by the Ministry of Science and Informatics in Poland (grant No 202-204-234), grants 
NSC 96-291 1-M 001-003-MY3 & AS-95-TP-A07, National Center for Theoretical Sciences in Taiwan, and NIH Grant 
R01GM076688-05. 
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Supplementary Information: Determination of factors governing fibrillogenesis of polypeptide chains using 

lattice models 

For the lattice model with M beads and four types of residues (H, P, -f , -), there are 4 M sequences. Out of the 65536 

(for M = 8) sequences 5950 have a unique ground state. Among the 65,536 sequences roughly half are mirror images 

of each other. Although, in reality these would be distinct sequences, in the lattice model they would yield identical 

thermodynamic properties. Since these sequences are redundant, we do not include them in the statistical analysis. 

To determine the amylome [1 (space of sequences in the lattice model that are amyloidogenic) we first determined 

the folding temperature for the 5950 sequences using the condition PNs(T f ) = 0.5, where P NS {T f ) = e~^ ENS /Z, 

v 

where f3f = l/fc^T/, Ens is the energy of the ground state, and Z = ^^e~^ Ei ; Z can be exactly evaluated for each 

1=1 

sequence because for M = 8 the number of conformations, that satisfy self- avoidance is only 1831. The energies 
E{ for all the conformations are computed using the chosen interaction energies between the near neighbors beads in 
the cubic lattice [2J. 

Because of the key role that the hairpin-like structure (N*) plays in the formation of fibril-like structures we 
analyzed, the 5950 sequences with unique ground state (NS) to determine their properties. We determined the 
temperature T*, where the probability of finding the chain in the N* conformation P/v*(T), is a maximum. We 
classify those sequences, which satisfy the condition, 1.0 < T*/Xy < 1.25 as amyloidogenic. If T* exceeds 1.25Tj the 
probability of accessing N* is negligible, and such sequences are unlikely to aggregate in finite time scale. Even at 
T* = 1.251/, many sequences have the maximum in P/v*(T), P^* ax < 1%. In our lattice model there are only 217 
such sequences, i.e. only 217 (0.66%) sequences that are aggregation prone. In what follows, we analyze a number of 
properties of the 217 sequences in order to provide insights into the tendency of natural sequences to be amyloidogenic. 

4 

Sequence Entropy. The sequence entropy for the 217 sequences is calculated using Sk = — ^] Pki ln(P^), where 

i=i 

Pki is the probability of finding the residue of type i in position k. If Pki = 1/4 (uniform probability) for all i then 
S k = m4 = 1.386 independent of k. We find that S k = 1.106,1.260,1.329,1.286,1.336,1.318,1.212 and 1.064 for 
k = 1, 2, 8 respectively. The terminal positions have a slightly lower entropy than the positions in the middle where 
the hairpin-like bend is formed. Because there is no position that is strongly conserved we surmise that sequence 
alone cannot be a good predictor of the tendency of a sequence to aggregate. 

Dependence of P^* 00 on A(= [Ens — En*]/Ens)* There is striking anit-correlation between P^^ and A 
(Fig. [7]), which establishes that in this model the extent of population of the N* conformation is the major determine 
of the propensity to aggregate. Interestingly for foldable sequences the Z-score (related to A in realistic models) is 
large. These considerations explain how natural sequences may have evolved to maximize Z-score so that unneeded 
aggregation is avoided. 

Sequence composition. In the 217 sequences, there are 54 different sequence compositions. We find there are 
different T*/T/ values for identical composition, which implies that sequence composition alone cannot be a good 
predictor of the propensity to aggregate pQ. 



Correlation between the type of beads at different positions is significant. There is a high degree of 
correlation between the nature of beads at various positions (see Tables: IpTI). For example, oppositely charged 
residues are most likely to be found at positions 1 and 8 (Table: |l|. Similarly, if a H bead is in position 2, then with a 
unit probability, a H bea d is found in position 7 (Table: [ill). A high preference (0.92) is found for H beads to occupy 
positions 3 and 6 (Table: |TTT| ). The correlation between the type of beads is not relevant at positions 4 and 5, which 
are the hairpin bend positions (Table. |IV[ ). Based on the results in Tables: T pV we find that +HHPPHH- is one of 
the sequences with high probability to aggregate (substantial P^^). This sequence has N* as the first excited state, 
T*/T f = 1.169 and A = [E NS - E N *]/E N s = 0.105. We also find other sequences with low T*/T f values. Three 
such sequences, which can form ordered fibrils such as the ones shown in Fig. 2a are ++HPPH- -, +-HPPH+- and 
+H-PP+H- with identical A(=0.10) and T*/T / (=1.21). 



[1] L. Goldschmidt, P. K. Teng, R. Riek, and D. Eisenberg Proc. Natl. Acad. Sci. (USA) 107, 3487 (2010). 
[2] M. S. Li, D. K. Klimov, J. E. Straub, and D. Thirumalai, J. Chem. Phys. 129, 175101 (2008). 



Jirlri 



-2.0 



(a) 



-2.2 



-2.4 



FIG. 4: (a) Spectrum of energies and associated low energy structures of the monomer sequence +HHPPHH-. H, P, + and - are 

in green, yellow, blue, and red, respectively. We set Ehh = — 1 and E-\ = 0. There are a total 1831 possible conformations 

that are spread among 17 possible energy values. The N* structure enclosed in the black box coincides with the peptide state 
in the fibril. 



TABLE I: Bead correlation between positions 1 and 8 



Residue in position- 1 


Probability of finding the residues below in position - 8 


+ 




H 


p 


+ 


0.0 


0.861 


0.0 


0.139 




0.861 


0.0 


0.0 


0.139 


H 


0.0 


0.0 


1.0 


0.0 


P 


0.313 


0.313 


0.0 


0.374 



TABLE II: Bead correlation between positions 2 and 7 



Residue in position-2 


Probability of finding the residues below in position - 7 


+ 




H 


p 


+ 


0.0 


0.650 


0.0 


0.350 




0.650 


0.0 


0.0 


0.350 


H 


0.0 


0.0 


1.0 


0.0 


P 


0.389 


0.389 


0.0 


0.222 



TABLE III: Bead correlation between positions 3 and 6 



Residue in position-3 


Probability of finding the residues below in position - 6 


+ 




H 


P 


+ 


0.0 


0.817 


0.017 


0.166 




0.817 


0.0 


0.017 


0.166 


H 


0.027 


0.027 


0.920 


0.266 


P 


0.380 


0.380 


0.040 


0.200 
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FIG. 5: The dependence of 77^ on N for Ehh — — 1. (a) E-\ = —1.4. The linear dependence holds for N < 12 and iV > 12 

but with different slopes. The crossover is related to the single and double layer structures of the fibril-like conformations. 

(b) E-\ = —0.6 and -0.8. Exponential behavior is observed. Results are averaged over 50 -100 MC trajectories, (c) Average 

energy per monomer as a function 1/N for Ehh = — 1 and E-\ = —1.4. The ^/-intercept, {E)/N — —13.17 as 1/N — >• 0. 
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FIG. 6: Dependence of r/i& on p™ ax for iV = 6 and 10. The plots are for conformation in the (a) second excited state and (b) 
third excited state shown in main text, Fig. la. Note that the structures have negligible population. 



TABLE IV: Bead correlation between positions 4 and 5 



Residue in position-4 


Probability of finding the residues below in position - 5 


+ 




H 


P 


+ 


0.049 


0.124 


0.444 


0.383 




0.124 


0.049 


0.444 


0.383 


H 


0.336 


0.336 


0.085 


0.243 


P 


0.263 


0.263 


0.22 


0.254 
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FIG. 7: Anti-correlation of \n(P^ x ) on A(= [E NS - E N *]/E NS ). The straight line ln(P^) = -7.15A + 2.69 is a fit to the 
data and has a correlation of 0.9. 



